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(57)Abstract: 

PROBLEM TO BE SOLVED: To fast retrieve the similar 
documents with no extreme deterioration of retrieval accuracy 
about a similar document retrieval method which calculates 
the resemblance between a master document and the 
registered one by referring to a full text retrieval index in a 
retrieval mode and without producing the feature vector of the 
registered document in a registration mode. 
SOLUTION: This similar document retrieval method includes 
a full text retrieval index production process as a document 
registering process and also a master document feature vector 
production process and a resemblance calculation process as 
the similar document retrieval processes respectively. In such 

a method, a retrieval word extraction process is added after the master document feature vector 
production process. 
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[0 09 5] 0 1 3 t/CHkLtc^f v 7* 1 3 0 1 V 10 

mrn^wvA. 1 3 1 omm&Mic-o^-c > ai5©p 

[009 6] &mmmmim7'n?7Ai 3 i«, s-r 
^7-^1 5oo&c:tet,>T\ ^iiass^aiT'n^A 

1 5 l £jg«fi u . mfev&li&tcmiit y-txVTl 

1 7 otcte&rrs. 

[0 09 7 ] ituiBXfy 71 5 0 0T7-?X'J 
717 0 iC^§n/c^-C©#l§{C*f 0-C> Xf- 9 7 1 20 
5 0 2~1 5 0 5£|gOiIL*tffS <*?••:> 7*1 5 0 
1) . 

[0 09 8 ] £T> 5 02(CteOT\ 7-* 

x r i 7 o KttMdtir b^#H«SfiS®ttin(c% 

[0 09 9] Xf-^1 5 0 3{Cfcl,^ til 
*H»W«3E7ay9Al 5 14ig»u SXffoSX 

[ 0 1 0 0 ] ■€• L/"C. Zf-yfl 5 0 4iCfcUT, 
•©WRSWHHfl»J. »r5e<0IWa*aA'Cl» &*>*«]£ 30 
U mX.-C^Z>m&<<CteZ? 9^1 5 0 5*. *l*.T(f> 

[0101] ZbX. 7,? -j7\ 50 5K*Jl,>"C. 

[0102] «±*j«Hitffl*iiaa7o y 7 a 1 3 1 © 

[0103] ft*, ±j$©*f- » 7" 1 5 0 2 

U 36Ktt. £»rtr©ttMB«#B£#J!rfSC<!: 

[0 104] H13fcmUfc^f»71 302X? 
fcifW^a^A 1 1 2fc<fcgjgstr3ft5®{Kg|?tfj 
T'n^A 1 3 2CDj5II#li{COt,»r < S16©PAD 

[0 1 0 5 ] a«fltJtHtt:/a*5Ai 3 2«, 7-*x 

•;r 1 7 oic&mtiK.±T<D&.mmm§icttbx> x 

f^l 6 0 2-1 6 0 3*^<3iSLHfff S (X^y 50 
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7*1 60 1 ) . 

[0 106] Xf^l602ttt, ^fflmtitUIllHl 
££:£$3tMt$B7 7 -f ;U 1 8 0 *#JBfC. 

[0107] XK^f 7^1 603 K4W»T. 93tim 
mMtii^OifvAl 8 2*8110. 7-^X'J7 17 

Kft?£s®:£»©gf&3ijM(g£frifiu 
#©$w«s«:fln#-fs 0 

[0108] «±^S(KS©tU7-a^7A 1 3 2 ©MS 

[0109] &±# h *mi<Dm-<Dmfe&mrc$>z. 

[0 1 1 0] fcfc, #*IW»-Ctt. fcsH&ftfcltlT^D ^ 
7A 1 3 0 {C J: 6*H#««ia3*i*4©i L/ 
#fgO^b0«:n-qranp!)5flim$n-S*>©iU-C4> 

«fci». com. timmmmm-Jvvv&i 3 net 

0 ACTS ft £#{4 n-qrami ft £ . 

[0 1 1 1 ] *fc. tt*m*HaiHJ7'ay*Ai 3 1© 

*7-?7'l 5 0 4-Cli. Xf97*l 5 0 3-Cgffi;*ftfc 
JE* 5 t> © i t SUIffiTW < SHUftOift 

mmv>M®m%.xi,>z>frzmj£.TZi><Dtb-ci>*. 
[o 1 1 2] *fc. *mnmvttmx.mctt?2>m.m 

ifi. $e»{ccft4«X«^a^X*©3»©B3^cJ: 

[0113] tU^Uc^C. *^CD^-OlU6 
»«McJ:fttf. 8^tcttr*g*8!imUX0ll*S$ 

^<g|IS©^m-C^ffl4^7 s -a- S C i *s-c £ 5. 
[0 114] CO^SiLt. tt9RfRK«SMl(ClST3 
tt*Ci«c<tft*fflflll8«*IW«r*C4*ir*. iKX 

CO 1 1 5] «c*5. *sawirtt. 

co 1 1 63 *fc. w±»wufc*awojn-<osasw 
(c*»*t4**fflW8»a^o^Ai 3 Ira, ax* 

i^CiO-Cfeil*. C©iS^©^ffl^^©t9^ 
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?2>i><DtlsXi>£i<\ 

[0117] •&tc*&m<Dm-<DmifflK.'o\,*xm i 
cons] *m%mbtcmmm&mi'z?'&<D 

[0119] *;£f£&cJ:*i«. Sr-CHteflKctettaiH 
fg«g|g|?ffl7a^7 a l 5 0 (CctS#f§SSg©ao[) io 

[0120] $£jfcMti. 3*-©m&W (0 1 ) 
fvL-M 0 0*JJnbS„ 

CO 1 2 1 ] «T. 3&-©^0«£fl&*iMgagg* 20 
^n^7A 1 5 0 a©&g^»C"?<,>-C0 1 8£ffll,> 

[0122] JHHSBSgm^a^A 1 5 0 at*. * 

-fx t » y i 8 o o tc*j«,»T. Sttttt y^A 

1 7 0 0 fcjSKrU £*&sftffltiNR7 t A ;H 8 0*# 

jirrsciKj:»). tt^*6»a3*ifc&*iis©:S:« 

[0123] ftfc. ^^S^ffltf^7 7 ^180H 
W&iS©a«*««©fl»»tt. H8(ciSLfc^S**ffl 30 
fit*B7T -OU8 0 3 £ LT^Lfccfc5K££&5itffltif8 
7t -OH 8 0 KtJS^ISCXSS^fc^^WSfeg^ 

tttt Stress cifcSWJU «|iS©»fcS:fc»#S 
[0 12 4] ZLX, X 7" ? 7*1 8 0 lKfcUt, fltfc 

[0125] «±#. ^fffiSSgtB^P #5 A 1 5 o 
a©#iHfWC$.3o 40 

[o 1 2 6 ] xmmwiciivzmmmmmma. 

tbXit, m^TF ■ IDF (Text Frequency, Inve 
rted Documents Frequency) &©£: UX &«fc 

[0127] &J^*»»©»n©!mwr*a. 

[0128] 6LhlW!l/fcJ:9K:. ##6<l©m-©sS|JS 

mfcmzmmxmikmt'Zf-j**m^z c tic* o . 
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#fg©migffigg*Si < KJE* 5Cit, S£#©#«t 
[ 0 1 2 9 ] *5»«<0»=<^|l6WK-3t>'Cia 1 

[0130] *i»w*affli/ytHiH*«ft*^^Ao 
»=fi^afi«». m-<DmmmtmmK.mxmfrh®& 

itwmzhtcgu-zmommmzmmtzmx&i 
mttnnwiicmwffly 7^1900 zmm 

[0131] J:ft«. »-©ISKWK*jW4# 

[0132] $^tkffl«. ^r©nss0i) (017) ta 
Eratt©«WS*KS#'. liW^a^Al 1 1©$ 

0.019 vcm* *> icmmmy r a mm 

m^-lnV^K 1 9 0 0#flnfc>So 88Str-<** 

1 0 3 (Ctt^itff #67 t -OH 9 1 0 tf&i^ft 
•5. i$IB#igSgg|£UJ:7W7A 150a©Xf^ 

1 8 0 ora. «i*^e.aili$tifc§*is©^*f ; - 

7 -f ;H 8 0 ^MT&ttfc OK. 019 KjsT 
«BHWR7W>H 9 1 0£#Jt8-f£J;5icfc3 o 
[0 1 3 3 ] JWT. »r©^WtH&aB»Wfp7n 
^Al 1 1 a©^H#Ji(CO^T02O%ffl^TSiW 

-fa. 

[0134] &fflm-7n A 1 1 1 aT?tt. *T X 

f^i2oo {cfcoreftfiJBKi^o ^120 

[0135] &(C. ^f^l201 fCte^T, 

s-fii/ri 7 0K*swsftTi>*ae:fc»c*ttE;* 

8 0M&lfrf<&. 

[0 136]^K1. X-r^T'2 0 0 0 {CfcOT, *&irt» 

«7T-/^maa«^n^Ai 90 o*jg«u. 1- 
>73-i)7\7o Mmztixuzmtt&itim-zifo 

[0137] icX±jfi. ammfawA 1 1 1 <omm. 

[ 0 l 3 8 ] 0 2 1 CcSti+ft?g7 7 A MtJ&%M7'a V 
7A1 gooK^o^sns^ittf^T-rjH 9 1 

[0139] *0(C*bfc^trtt$g7 7^H910K 
tt. fil§^2 10 0. #lg2 1 0 1 toJ:VtiimiC&%. 

2 1 0 2*W$n^.„ 

[0140] xmicmbtcmxi^ <smwo -comm 
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k. WLA'wtMssh, mtno\mxmLifi" 1 

•c* * 4 t, o <fc 5 (c&ms ft* C 4 Znk b rt»« o 
[0141] 02 1 KwlfcfllrB. tOHMB? 

9 0 0**»£rf8tt<&ftafc©it/fc#. # 
Hi ffi?CW«JMR»'C* SJ&5$T*ftfcf 4*©cfc 5 
ST* o tUtf. h^-fJ^StCttlrtS 

ft4fc©4t/Cfcj&»*fe«tl»U £%mmHM7 r 4 
;u 8 0©5feaM^(cffiiWur*j< fc©4 ixbfrth 

ftl>. 

[0142] JH-btf, *»M©»=©*(WH"Cft4. 10 
[0143] fiUJffl L/c J: 9 <c*»9i©»=©aailW 

»cWH(MR*Bl»-r -SCi*ir#4J:^{cft4. Cft 
r*SJ:^(cft^ 0 

[0144] ^K*#6WOflrao**Wfi|{C-5t>Tia2 2 
fcJflHTBMST*. 20 
[0145] *»i5!*jWLJft*ft(*»»*^f k A© 

mwonmmte . e>» u s n fc*#ii©ttttt» 
«*ififl*OT«ffl-ra*>©ras. 
[oi46] **a«: ^tm. mmnmommzmmc 
ffirs-ss c i ft < . m^om&VHasvzmfiim? 

7 -OH 9 1 0K^3ti&tMH1HR©$a«M»r« 
C4#-et£<fc5tcft-5. 

[0147] *£jtfflK> JR=©£Sfcffl (H 1 9 ) t « 
H9Bltt©1ftji&*ft&tf. «»Httt«Rl7ia*5A 1 7 0 
0©#Jj£#Hft9> ifi«|tlH»«*a^O^A2 2 0 30 

[ 0 1 4 8 ] KT. *=©^W4JWS«HH1II«#JB 
T'D^A 1 7 0 0 b ©MS^JItc-o^r H 2 3 

[0149] tfetHM#JRl^ci ^7A 1 7 0 0 b ». a 
S**6»a3tlfc^"C©*H«:-3(,»-C^f , »^2 3 0 

1-23 0 A^m^ 'MLWilth (XT' v 72 30 
0) . 

[ 0 1 5 0 ] Xf"9 7"2 3 0 1 r», tttHttt? 7 A 
1 9 1 0*#MU ttWcM|^««miHR«Mm3 40 

[0 1 5 1 ] *t/C, ^#IS*l«Kim#87 7^19 1 

0 4HCtg|ft3ft-tU£t&£(C«;*7 v » 72303 *S£tf 

u ts^$nrt,»ftt»ti^cc«x7 L ^7 , 2 3 0 4*mn 

(Xf-7 7'2 3 0 2), 
[0152] Xf»^2 3 03r«> WHt«!7y^A 

1 9 1 0£#mU R40i©tftlH1NR£Kft-$'*. 
[0153] 72 3 04-Cit ififKStl+tS 

^Hffiyq y?A2 2 0 0 ftfiftU te#fS©i5^it 

f***j«HT*. so 
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[0154] JBLb&l. iKltmfg#JiS7ny7 A 1700 
b©^iI#Jit?*-S 0 

[ 0 1 5 5 ] Jfcfc. £4Kl(fttfflrttfttU7ci ^A2 2 0 

0 ©MiI#lt{C-X>-C0 2 4 £m>T£#ffKC|ftBJ-f 

6. 

[0 1 5 6 ] ^miC^btcmVit. *r^fy7"2 3 0 

1 Kfel^T. lffiiHg«£IRf§T4*m4ft.&ilii§2 4 0 
0"LAN"#LT. «fH»«7y-f;H 9 1 0£#Mt 
S. 

[0 157] CCTii. ttttflWR^T-OH 9 1 OK 
a"LAN"W3htl,iia:l>fcft, Z7-V72 304 

[0158] Af-v-72 3 04t?». #l§2 4 0 0"L 
A N "©^)j£S^-C* S " L A " 4 " A N "OlttHlMll** 

LAN"©&fHlMfi4 L-CBBET*. 
[0159] *HK^Lfc«"C«. "LA"®|»H1HR2 
4 0 lK*Mft3ftfca*£«*"8 0 7"£. "A N "©M 

mm 4 02 KttttsntetHitswtt" 1 5 1 2 -4£ 

JfcRU C©£JH4L-C"LAN"©&ittt&2 4 0 34 
(24 10). 

[0 160] Cfttt, #«"LAN"©li0aBR ,, LA" 
4"AN"©H«l*»»*Wtt*»^. "LAN"©H«I* 
tf£fctt&«^5& J: 9 < ft £ C 4 »* 0 *.ft<<> 
}ttW4*lflrr* *>©-<?**. -fftfc*,, WLAN" 
©ffl3t£»ifc4l,T:{;t. 2fs*"L AN"-e©fe©©W31X 
WJteJBoS'**?****. *ii"LAN"©if^s^-c 

*4"LA"**t»»"AN"©9fe. tti3t£S!$[©4>ftC> 
[0161] «JyWfi«(i»ltt(NRIIffl7a 220 
[0162] «±*«*l6M©»H©*(S«r*5. 

[0163] viimnbtcx'^c. x&wofmomm 

^T©*H©H}3W^*ttlttlWR7 r -f Jl/<v&nw S 
i«W#ft<&*fc». JH=©KB»WKifc'*. tfesHtfR? 
7 -f ;u©£S£ iWET •5C4#'CS£<fc5«:ft5o 
[0164] fel±lttBj Uc J: 5 K. *IMUO»-©«» 

K*MKJdWWS***r 4 C 4**e* ^ 0 

[0165] ^k:. *m%<Dmz<Dmmwicn>x®2 
[ o i 6 6 ] *mizmbtcmm?mkmt<>z7- a© 
^>©f*s 0 

[ 0 l 6 7 ] **ffiK intf. i-1f liBff^©^B#lffl 
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[0168} -#mm\ts n-vmmw on taa 

|§lli©«lj)t*HKS#, SHHffiUlii^n ^7 A 1 3 2©flt 

idi#Sl& 9 . «ttWaWWIIHi^a *5 A2 5 o otm 

[o 1 6 9 j «t. %-ommmtgtj:zmm.mti7 

P^Al 3 2 b©05S»£@2 6©PAD|g£ffll> 

[0 17 0] SWHflltKtffl^O^Al 32b«. 10 
72 60 0 (C*t»T. fM«raB#mtHi^a ^7A2 5 

[0 17 1]», 7-^x«J7 1 7 OtCtftWSftfc* 

T. ttftMIR&fll&iMO HTft&tf. *f->7l6 0 
2. 1 6 0 3*J<fcD f 2 6 0 2^0igLHtT-r-5. (^f- 
^26 0 1). 

[0172]^f^lB02TB > ^ffl#HWSIa] 
SStSXft^Pi^Al 6 l^jgtbU <ft*ffl*»{C»lE-r 
&±X**ffl1IMR7 T-/JH80 *#J1LT V 20 

[0173] ',Xt£Z7-v -71 6 0 3 itfcl>T, SXSHH 
[0 174] ^UT. Xf^2 6 0 2iCfel,>t. (ft* 

»ai»BiHW7 , n ^7A2 500 *e»u tt$xrai$ 30 

[0175] W±^Sf!MmW7"P^7A 1 3 2 b©** 

[0176] «±*«*»H©»Hi!MOl0l5»"e**. 
[0177] **H.W©*f' s» 72 6 0 1 JCfoW 

z>&mmmmte> ^mnm^mmttximr 

tTte< *>©<b bXt>£i,\ 

[0 1 7 8 ] tit. tt«MIR«H«|»g 
irZkotUctiK We^B.ic^r>x\VP&LO^mmm^ 40 

[0179] 3 £fc. *H*W-CH:. *&&19$|IIStii 

7a 77 a 2 5 0 0 *ffli»-ca<Ha»HttaaKB-r.*^ 

<DtbXi>£\,\ C®m£, 02 6CC7n0fc*f^72 
6 0 0'C^*ll#lffl©ft«l4B8*S , r-5©r«J&< < &sfcfM 50 
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$J7a?7Al 12K«t04ftSfe*frJ»tff^oy5Al 3 

0 0 &gtt(s. «^i«MBio«£«iettrntf J:i». 
[0180] «±bmh i/fc j: *) Kizmmm^-mm 

ic&rs^xv&msumiwa&ztLib. ^©^b* 
Rar&?}?$§ui£ Kf#-r 6 c 1 &x * -5 J: 5 k ft 6 . 
[0181] C<Dte£tbX, a— IfBKWWTWB* 
WT 4 C £#r * 6 J: 5 fc ft 
[ 0 1 8 2 ] ttfc, ^-©^fc^i*>6^ra©msM"csi 

L tal£#©S{Kg*@£{tfc£*te7?- SS<H£# 

zwtisz?-2*mmx®'owiLxfcmtz>c tb*imx 
&*„ 

[0183] *^©^A©HllfeWco^-C02 
7£ffll>T88Bj3-rS„ 

[0184] *»H*affll/fcafl5CStMWR5/^f-A© 

So 

[0185] «fcti« . m-©HfcM*>6Ht0© 

[0186] xmt&wiz, jb-hdhbw (0 1 ) tea 

ISI«©*fiE*lR3#. *^ffl*B«ia^a^5A 1 3 1 

7a^-7A2 7 0 0*JJnto-5. 
[0 1 87 ] filT. »-©j|»«i»a*|to||«Wgffli 
Ui7'n^Al 3 1 b©£&S?*S:02 8©PAD0£ 

[0188] temm^nsfoib-fv a 1 3 1 -e»» $ 

■fx? 9 7 1 5 0 0K*H>T. WSMHtJHH^a** 
Al 5 1 fcfattU 3T«©^W^{c»^t7-e'x«J7 

1 7 0(ct&ffl3tifcijM§©Sgg«:trHiU 7-*x 

r 1 7 0 «ot»ifrr*. 

[0 18 9] iffEXf-^7*l 5 00-C7-^X'; 
T 1 7 0 tC*S«l 3 tltc±X ©^fg{C>Pf LT, 
5 0 2~1 5 05*m&Lmi7?2> (Xf»71 50 
1) . 

[0 190]*f, Xf»7l 5 0 28C*Jl»T. 
i'J7 1 7 0K«|fl3nTl>S*i|*S»BE©PlliKBS{ 

[0191] *K. Xf^l 503K*Ht. ft^ffl 
*aB«fflfflE^Ojf9Al 5 1 ^SKiO. SX«©B^ 

[0 192] -etT. ^f-»7l 504(C*Jt>-C. 
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U ISx.TI,> 6 f-^l 5 0 5£. M3LX^ 

[0 19 3] *l/C, Xfy^l 5 0 5(C*lvC. m<£ 
&%&mM$ffltbXV-9x>)T 1 7 0 KfMfrr*. 

[0 1 9 4] ^f ?^2 8 OOfCfc^r. 7-^ 

xy 7 1 7 0 KtimSftfc1^ffl4iSfe6«^n&$ 

^LfciOK. iffi^#£>2 9 0 lfcitf+i^-fe.iUsF 
*>2 9 0 1*#T*^9*-S>2 9 0 0£3l^fSfc 

[0195] «±*^ffl^fgffllffl^P^7A 131b 

©W^Jir&s. 

[0196] fcfc. ±HX7--J 7" 2 8 0 0 (Cfctf-Sfg^ 

[ 0 1 9 7 ] IMXt-j 7-2 8 0 0 CCfcttifc* 
«*lffl©Jtii&&8;4 0Ttt. »fl^ffl$fla©tt^ft&*> 

i-kn&mim®? 7^i8 o©-*^ xa>&atjrr* 

[0198] fcLbSMJ Lfc<fc 5 K, #3afiWc5*t,fcS 30 
[0199] 

[|£BJJ©58jS] £UJMH0fc«l:5K:. *«l8"Ctt. 

«©mhs* a${ctt*ffl^a»*«sEL r i» s tab. 
mmmmmt ztmmmmzwmt s c 4#r 
ts. cntcj:^ &^ffig*fig&-f £t4©-ci*sii 40 

[0iB©»St&Bj] 

[si ] *mm-<mmfcmmwstm$mi' 

[@2] aa5a«i©«si#«*aii!8r*PADia'c* 

[S3] u*8m\v>wm*miiri>wx-9>z. 
[@4] se^as 1 ®mm&tiim.<D%z.i5zmit 

S0-C&S. 

[05] Sfcfc&ffi 1 ©SHRfl^ai^©**.* fcSMiT 50 
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50-C&*. 

[06] ^^©^©^jSitflKcijWS^l^MftSSi 
fg7 , py7A2 7 0 0fc££S§gy^~iP©M-C* ' 
£. 

[07] *|^©^^*BMIT*PADI8'r*S. 

[08] *is»©as»jaai©«is4Uii»TS0'c*5. 
[09] *m®&m%m<DWi%zmw-?z>®-c&z>. 
[0io] «%«©tt^iui«ti»rai©«s«3ii»-r 
S0r*s. 

[011] *«W©*-©!gfcHfc:btt52/Xf-A*H&P 

ya^Ai i o©^a^)i^siBj-r-2)0-c*So 

[012] 2^W©»-©S|jfi«K*jW4a«W8P^a 

^aii i©«ra**4Biiwr*HT?**. 
[013] *»w©»-©ias«K*jw*ttsRwai^ci 
y^Ai 1 2©«ia^ii^wr-5PAD0r&s„ 

[014] ^^©^-©HffeWKJsWS^^frft?*? 
^n^A 1 3 O©^ffl^)i?:^-rsPAD0r* 
5. 

[015] *a^©^-©mCT{c4BW5^ffl*isa 
ttj-TW^Ai 3 i©*HMi*«!jjTaPADiaT* 

[016] *aw©a-c^its«{c*j^saffiK»iH^ 

P^5A1 3 2©^a#J8£»fSPAD0-t?£>6 o 

[017] *ftw<Dm~(ommMiciiUzmmmg&w 

Jij^P^Al 5 0 a©«lS^^T0-C&4 o 
[018] ;to8©*H©aMfc*ttS¥l§NSg 
tH^P^A 1 5 0 a©«Hl**%»?W-*PADH-C 

[019] *aw©»=c«afi«K:fcwa3Mwi»^a 
[020] *afPi©»=©iiai«K:towsaaw8P^a 

^Al 1 1 a©MW&^-rPAD0-C*5o 
[02 1] 2M69i©*=©ll»«{c*jWStfctH»*7T 
•/JH9 1 o©0a-c&£„ 

[02 2] *^BJ©m0©IIWJ{CtjWS^tHf$R#BS 
7 P ^7 A 1 7 0 0 b (DMf&£^?m-C$> & . 

[02 3] *|6^©^a©SI8feP!|{C*J»4Sl®tt»«#Ji 
7*P ^ A 1 7 0 0 b <Dl&m^m$:WlWt *> P A D0-C 

[024] xmwmwonmmicisvz&immn 

[02 5] *»9i©IBE©ll*«K:J»w*SHHK»fli^ 
a^5A 1 3 2 b©^^-T0r*^„ 

[026] *&w<Dms.<D3mmicisvz>mimwiii7 

a^Al 3 2b©^a#)i ; &lftBJ-r-5PAD0-C* 

[027] x&wmxommmicisvz&mmm&m 
hj^w^ai 3 1 b©^^-T0r*?>„ 
[028] xmwomxomfcmtckvz&mm&im 
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1 0 1 K, 102- 

<p&m&sim$£m ( c p u > . 103 -ma,? 1 * * ?m 
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* NOTICES * 

Japan Patent Office is not responsible for any 
damages caused by the use of this translation. 

1 This document has been translated by computer. So the translation may not reflect the original 
precisely. 

2.**** shows the word which can not be translated. 
3. In the drawings, any words are not translated. 



DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Field of the Invention] This invention relates to the approach of searching a document including the 
contents described by the document specified by a user, and similar contents out of a document 
database. 
[0002] 

[Description of the Prior Art] In recent years, with the spread of a personal computer, the Internet, etc., 
the electronic document is increasing explosively and is expected to continue to increase at an increasing 
tempo. In such a situation, the high speed and the demand of wanting to search efficiently have been 
increasing the document including the information for which a user asks. 

[0003] As a technique which meets such a demand, the document (it is hereafter called a seed document) 
with which the user included the contents for which he asks is illustrated, and the similar document- 
retrieval technique of searching the document and a similar document attracts attention. 
[0004] As the approach of a similar document retrieval, "JP,1 1-66086,A" is indicated, for example (it is 
hereafter called the conventional technique 1). 

[0005] Information required [ in case a document is registered to a document database ] of this 
conventional technique 1 in order to carry out the full-text search of the document used as the candidate 
for registration (with the conventional technique 1, it is called the transposition index.) Hereafter, it is 
called the index for full-text searches. It creates. At the time of retrieval of a similar document The 
vector which has as an element the frequency-of-occurrence information on the word contained in a 
registered document (it is hereafter called a registration document) by referring to the index for these 
full-text searches The feature vector of the document (it is hereafter called a seed document) which 
created (it is hereafter called a feature vector) and was specified as this and retrieval conditions is the 
technique which computes the cosine of the include angle made in vector space as similarity between 
documents. 

[0006] Hereafter, the procedure of the conventional technique 1 is explained using the PAD (Problem 
Analysis Diagram) Fig. of drawing 2 . 

[0007] With the conventional technique 1, first, in step 200, when registration processing of a document 
or retrieval processing of a similar document is judged and it is judged with registration processing of a 
document, the index creation step 210 for full-text searches is performed, and the index for full-text 
searches is created. 

[0008] Moreover, when judged with retrieval processing of a similar document in step 200, the seed 
document feature-vector generation step 220 is performed, and a feature vector is created to a seed 
document. And the similarity calculation step 221 using the index for full-text searches is performed, 
and the feature vector of this seed document and the feature vector of a registration document compute 
the cosine of the include angle made in vector space as similarity between documents. 
[0009] The above is the procedure of the conventional technique 1. 

[0010] Hereafter, the outline of this conventional technique 1 is explained using drawing 3 . 
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[001 1] In document registration processing of the conventional technique 1, the word and appearance 
location which are first included in the document 1 for registration and a document 2 by the index 
creation processing 210 for full-text searches are extracted, and the index 403 for full-text searches is 
created. Consequently, it is recorded on the index 403 for full-text searches like "construction: 
(documents 1 and 5) (documents 2 and 8)." Here, "construction:(documents 1 and 5) (documents 2 and 
8) "character string" construction" means having appeared in the 8th character of a document 2 to the 
5th character of a document 1 . 

[0012] And in retrieval processing of a similar document, the seed document specified on retrieval 
conditions is extracted, and the seed document feature vector 406 corresponding to this seed document 
is generated through the seed document feature-vector generation processing 220. 
[0013] Next, the count of an appearance in each registration document is acquired by referring to the 
index 403 for full-text searches created by said document registration processing to all the words 
contained in the seed document feature vector 406. 

[0014] The cosine of two vectors X and Y notes being obtained by doing the division of the sum-of- 
products value of the components (for example, x (i) and y (i)) to which a vector corresponds by each 
magnitude of a vector here at drawing 4 so that it may be shown. That is, after calculating the inner 
product component (it is hereafter called the similarity according to element) for every element of a 
vector rather than computing the inner product between specific vectors for every group of a vector, 
total of the similarity according to element in all elements is computed. In addition, the i-th element of 
Vector X is expressed in drawing 4 as "x (i)", and the magnitude of Vector X is expressed in it as "|X|." 
[0015] That is, in order to compute the cosine of the seed document feature vector 406 and the feature 
vector of a registration document in drawing 3 , to all the words in the seed document feature vector 
406, the sum-of-products value of the count of an appearance in a seed document and each registration 
document is computed as similarity according to element for every [ in each registration document ] 
word, and it can compute by taking total of the similarity according to element for every word about all 
registration documents. 

[0016] Hereafter, this similarity calculation approach is concretely explained using drawing 5 . 
[0017] When a seed document feature vector is expressed as Vector Z, the 1st component of the inner 
product of a seed document feature vector, a feature vector 1, and a feature vector 2 can compute [ a 
feature vector ] Vector Y and the feature vector (it is hereafter called a feature vector 2) of a document 2 
for Vector X and the feature vector (it is hereafter called a feature vector 1) of a document 1 as "x(l) y 
(1)" and "x(l) z (1)", respectively. 

[0018] Here, "x (1)" expresses the count of an appearance in the seed document of a word 1, and "y (1)" 
and "z (1)" express the count of an appearance in the document 1 and document 2 of a word 1, 
respectively. 

[0019] That is, the count 600 of an appearance in each document of a word 1 is acquirable by referring 
to the index for full-text searches corresponding to a word 1 while carrying out counting of the count of 
an appearance of the word 1 within a seed document. 

[0020] The similarity of a registration document to a seed document is computable like the following by 
referring to the index for full-text searches corresponding to all the words in a seed document. 
[0021] The above is concrete explanation of the similarity calculation approach in the conventional 
technique 1. 

[0022] Finally, the similarity 407 of each whole registration document is outputted. 
[0023] The above is the outline of the conventional technique 1. 

[0024] As explained above, generation of the feature vector of a registration document is enabled at the 
time of a document retrieval by creating beforehand the word index for full-text searches for words 
contained in a registration document according to the conventional technique 1, and the document with 
which the contents are similar out of a document database can be searched with computing as similarity 
a cosine with the seed document feature vector corresponding to the seed document specified as retrieval 
conditions. 

[0025] However, since it is used for similarity calculation with reference to the index for full-text 
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searches to all the words extracted from the seed document, when there are many words contained in a 
seed document, I hear that the huge processing time is needed for the conventional technique 1, and it is 
in it. 

[0026] For example, supposing 100 kinds of words are extracted also as reference being possible in 0.5 
seconds after a seed document in the index for full-text searches to one kind of word in a seed document, 
the processing time as long as 50 seconds will be required. 

[0027] On the other hand, in order to reduce the processing time, when the word of a seed document 
feature vector is thinned out simply, since the number of classes of a word is reduced, even the word 
which has important semantics by the seed document may be eliminated, and there is a possibility that 
retrieval precision may fall extremely. 
[0028] 

[Problem(s) to be Solved by the Invention] By this invention, it aims at solving the following technical 
problems to such a problem. 

[0029] That is, the technical problem of this invention is realizing the high-speed similar document- 
retrieval approach by creating the feature vector of all registration documents at the time of retrieval of a 
similar document, and using the minimum number of words which can secure retrieval precision in the 
similar document-retrieval approach of performing similarity calculation using the newest word 
information, without creating the feature vector of a registration document at the time of the document 
registration to a document database. 
[0030] 

[Means for Solving the Problem] The procedure of a similar document retrieval shown in this invention 
for solving the above-mentioned technical problem is shown in the PAD diagram shown in drawing 7 . 
[0031] The processing classification judging processing 200 in which the similar document-retrieval 
approach shown in this invention judges registration processing or the Kensaku processing, In the 
similar document-retrieval approach of having the seed document feature-vector generation processing 
220 and the similarity calculation processing 221 in which the index for full-text searches was used, as 
registration processing of a document as the index creation processing 210 for full-text searches, and 
retrieval processing of a similar document It is characterized by having the word extract processing 701 
for retrieval between the seed document feature-vector generation processing 220 and the similarity 
calculation processing 221 using the index for full-text searches. 

[0032] Namely, the similar document-retrieval approach by this invention As index creation processing 
2100 for full-text searches at the time of the document registration to a document database (Step 1) 
From the text of the document for registration read at the registration document read in step and the 
above-mentioned (step 2) registration document read in step which read the document for registration 
The information file creation registration step for full-text searches which extracts the information for 
full-text searches and is stored in the information file for full-text searches, As seed document feature- 
vector generation processing 220 in retrieval processing of a similar document The seed document 
acquisition step which acquires the seed document specified on retrieval conditions, (Step 3) The seed 
document read at said seed document read in step is analyzed. (Step 4) the count of the appearance in a 
seed document which carries out counting of the count of an appearance of the word extracted at the 
seed document analysis word extract step and the above-mentioned (step 5) seed document analysis step 
which extract the word contained in a seed document - counting - with a step It is based on the count 
of an appearance of each word by which counting was carried out at the step, as the word extract 
processing 701 for retrieval - the above-mentioned (step 6) count of the appearance in a seed document 
- counting - A word is chosen as the descending order of the weight of each word computed by the 
word significance calculation step which computes the significance of this word, and the above (step 7) 
(step 6). The word judging step for retrieval which extracts this word as a word for retrieval when the 
similarity according to element of this word to the seed document itself is computed and this similarity 
according to element exceeds a predetermined threshold, As similarity calculation processing 221 using 
the index for full-text searches, it sets to the above-mentioned (step 8) seed document feature-vector 
generation processing 220. The similarity calculation step which performs the following - (step 9) (step 
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10) using the word for retrieval extracted from the seed document, The count acquisition step for 
retrieval of a word appearance which acquires the count of an appearance in each registration document 
of this word for retrieval with reference to the information for fiill-text searches created at said 
information file creation registration step for full-text searches, (Step 9) (Step 10) said count of the 
appearance in a seed document about this word for retrieval chosen at said word selection step for 
retrieval — counting -- the count for retrieval of a word appearance in each registration document 
acquired at the count of the appearance in a seed document acquired at the step, and said count 
acquisition step of a word appearance It has the retrieval result output step which outputs the similarity 
which computed the similarity according to element of a seed document and a registration document by 
having used, and was computed at the similarity calculation step classified by element added to the 
similarity of each whole registration document, and the above-mentioned (step 11) similarity calculation 
step classified by element. 

[0033] The principle of this invention using the above-mentioned similar document-retrieval approach is 
explained using drawing 8 - drawing 10 . 

[0034] the similar document-retrieval approach of this invention - the time of the document registration 
to a document database — and (step 1) (step 2) it performs. 

[0035] Hereafter, the outline of procedure of facing registration of a document is explained using 
drawing 8 . 

[0036] First, the document which serves as a candidate for registration at (step 1) is read. In the example 
shown in drawing 8 , a document 1 "a device required for construction of LAN, and employment and 
maintenance is offered." and a document 2 "it ties up with SI vendor which deals with construction and 
maintenance of an information system." are read as a document for registration as a document for 
registration. 

[0037] Next, in (step 2), from the text of the document for registration read above (step 1), the 
information for full-text searches is extracted and it stores in the information file for fUll-text searches. 
[0038] In the example shown in drawing 8 , (documents 1 and 1) are extracted as information for full- 
text searches corresponding to "L" contained in a document 1, and it is stored in the information file 803 
for full-text searches. In addition, L (documents 1 and 1) means that alphabetic character"!," appears in 
the character position 1 of "a document 1." 

[0039] Moreover, as information for full-text searches that it uses here, if the word of arbitration or the 
count of an appearance in each registration document of a character string is acquirable, as shown in the 
conventional technique 1 , it is good also as a thing using a word index method, and good also as a thing 
using the n-gram index method currently indicated by "JP,08-194718,A." 
[0040] The above is the outline of procedure of facing document registration of this invention. 
[0041] Next, by the similar document-retrieval approach shown in this invention, - (step 3) (step 1 1) is 
performed at the time of retrieval of a document. 

[0042] Hereafter, the outline of procedure of facing retrieval of a document using drawing 9 is 
explained. 

[0043] The seed document 901 "a solution is developed for the construction know-how of a LAN 
system to arms ..." specified as retrieval conditions first (step 3) is read. 

[0044] And in (step 4), a seed document is analyzed and the word contained in a seed document is 
extracted. As seed document analysis processing in which it uses here, as shown in the conventional 
technique 1, a word dictionary is referred to. The method from which the word contained in a word 
dictionary is extracted may be used, may use the word extract approach using the statistical information 
in a document database as indicated by "JP,10-148721,A", and You may be the approach of extracting 
mechanically n-gram contained in a seed document, and it does not matter even if it uses other word 
extract techniques. 

[0045] In the example shown in drawing 9 , the word train 903 (LAN, construction, know-how, arms, a 
solution, expansion, -) is extracted as a result of this seed document analysis processing. 
[0046] Next, in (step 5), counting of the count of an appearance within the seed document of the word 
extracted above (step 4) is carried out, and the group 904 ([LAN, 4], 3 [ [construction and 3] ], 2 
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[ [know-how and 2] ], 1 [ [arms and 1] ], 2 [ [a solution and 2] ], [expansion and 1] --) of a word and the 
count of an appearance is outputted. 

[0047] Here, [LAN, 3] mean that word"LAN" has appeared 3 times. 

[0048] Next, in (step 6), to the group 904 of the word extracted above (step 5) and the count of an 
appearance, significance is computed and the group of a word and significance is outputted. the number 
of appearance documents of this word [ as opposed to the number of documents good as a count of an 
appearance in a seed document for example, which carried out and was registered into the database as 
the calculation approach of this significance ] — comparatively (the following and an appearance — it 
calls comparatively) — etc. - you may use. In the example shown in drawing 9 , the count of an 
appearance in the inside of the seed document 901 is computed as a significance of a word, and word 
significance train 905" [LAN, 4], [construction and 3], and [solution and 2] — are outputted. Here, 
[LAN, 4] mean that word"LAN" is contained in a seed document as significance "4." 
[0049] And in (step 7), when the similarity according to element to the seed document itself is computed 
in descending order of the significance of each word computed in the above (step 8) and this similarity 
according to element is over the predetermined threshold, this word is extracted as a word for retrieval. 
As this result, the word for retrieval [LAN, 4], and [construction and 3] are extracted. 
[0050] Next, in - (step 8) (step 10), the similarity of each registration document to a seed document is 
computed by referring to the information file 803 for full-text searches created with the count of the 
appearance in a seed document of each word acquired above (step 7), and the above (step 2). 
[0051] And the similarity calculation result 906 is outputted in (step 11). 
[0052] The above is the outline of procedure of facing the document retrieval of this invention. 
[0053] the following - having mentioned above (step 7) — the extract procedure of the word for 
retrieval performed is explained using drawing 10 . 

[0054] First, in (step 7), the word significance train 905 outputted above (step 6) is read, and a word is 
chosen as the descending order of significance. In drawing 10 , [LAN, 4] are first extracted from the 
word significance train "[LAN, 4], [construction and 3], [a solution and 2] --" 905. 
[0055] And the similarity according to element of this word for retrieval of the similarity of a seed 
document to a seed document is calculated using word "count of the appearance in kind document of 
LAN"" 4for retrieval." That is, it assumes that it is that (it is hereafter called a virtual registration 
document) in which the document same as a registration document as a seed document exists, the 
similarity according to element of this word for retrieval between a seed document feature vector and the 
feature vector of this virtual registration document is computed, and total is computed. 
[0056] drawing 10 - the object for retrieval - "LAN "count of the appearance in seed document" 4" and 
count of the appearance in virtual registration document" of word 4 - " - a product - computing - the 
similarity according to element - "16" is obtained. 

[0057] Consequently, since the similarity according to element to the seed document by word"LAN" for 
retrieval itself is over the predetermined threshold (referred to as 5 in the example shown in this Fig.), it 
is stored in a work area 170 as a word for retrieval. 

[0058] Next, [construction and 3] with a high significance are chosen as the degree of [LAN, 4], and the 
similarity according to element of this word for retrieval of the similarity of a seed document to a seed 
document is calculated. Consequently, since the similarity according to element was set to 9 and is over 
the predetermined threshold 5, it is stored in a work area 170 as a word for retrieval. 
[0059] And [a solution and 2] with a high significance are chosen as the degree of [construction and 3], 
and the similarity according to element of this word for retrieval of the similarity of a seed document to 
a seed document is calculated. Consequently, since the similarity according to element is set to 4 and is 
not over the predetermined threshold, it is ended, without extracting as a word for retrieval. 
[0060] The above is explanation of the word extract procedure for retrieval. 

[0061] Instead of creating the registration feature vector to a registration document at the time of the 
document registration to a document database, as explained above The index for full-text searches is 
created. At the time of retrieval of a similar document In order to use the word which extracted the word 
for retrieval in order of the significance within a seed document among the elements of the feature 
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vector in a seed document, and was extracted until it was completed by the similarity to the seed 
document itself as a word for retrieval, Compared with the case where all words are used for retrieval, it 
becomes possible to compute the similarity of a seed document and a registration document at a high 
speed, without dropping retrieval precision extremely. 
[0062] 

[Embodiment of the Invention] Hereafter, the first example of this invention is explained using drawing 
1. 

[0063] The first example of the similar document-retrieval system which applied this invention consists 
of buses 106 which connect a display 100, a keyboard 101, arithmetic and program control (CPU) 102, a 
magnetic disk drive 103, the floppy disk drive (FDD) 104, main memory 105, and these. 
[0064] A magnetic disk drive 103 is one of the secondary storages, and the information file 180 for full- 
text searches is stored. 

[0065] The information stored in the floppy disk 107 through FDD 104 is read into main memory 105 or 
a magnetic disk drive 103. 

[0066] While a system control program 110, the registration control program 1 1 1, the retrieval control 
program 1 12, the registration document read in program 120, the information file creation registration 
program 121 for full-text searches, the retrieval condition analyzer 130, the word extract program 131 
for retrieval, the similarity calculation program 132, and the retrieval result output program 133 are 
stored, a work area 170 is secured to main memory 105. 

[0067] the retrieval condition analyzer 130 - the seed document acquisition program 140, the word 
extract program 142, and the count of the appearance in a seed document — counting - it consists of 
programs 143. 

[0068] The word extract program 131 for retrieval consists of a word significance calculation program 
1 50 and a word extract judging program 1 5 1 for retrieval. 

[0069] The similarity calculation program 132 consists of a count acquisition program 161 for retrieval 
of a word appearance, and a similarity calculation program 162 classified by element. 
[0070] The registration control program 1 1 1 and the retrieval control program 1 12 are started by the 
system control program 1 10 according to the directions from the keyboard 101 by the user, and perform 
control of the registration document read in program 120 and the information file creation registration 
program 121 for full-text searches, and control of the retrieval condition analyzer 130, the word extract 
program 131 for retrieval, the similarity calculation program 132, and the retrieval result output program 
133, respectively. 

[0071] In addition, although the registration control program 1 1 1 and the retrieval control program 112 
shall be started in this example by the command inputted from the keyboard 101, the command or event 
inputted through other input devices may start. 

[0072] Moreover, it is also possible to store these programs in storages, such as a magnetic disk drive 
103, a floppy disk 107, MO, CD-ROM, and DVD (not shown in drawing 1 ), to read into main memory 
1 05 through a driving gear, and to perform by CPU1 02. 

[0073] Hereafter, the procedure of the similar document-retrieval system in this example is explained. 
[0074] First, the procedure of a system control program 1 10 is explained using the PAD diagram of 
drawing 1 1 . 

[0075] A system control program 1 10 is step 1 100 first, and analyzes the command inputted from the 
keyboard 101. 

[0076] And at step 1101, when this result is analyzed as it is the command of registration activation, the 

registration control program 1 1 1 is started at step 1 102, and a document is registered. 

[0077] Moreover, when it is analyzed that it is the command of retrieval activation, the retrieval control 

program 1 12 is started at step 703, and a similar document is searched with step 1101. 

[0078] The above is the procedure of a system control program 110. 

[0079] Next, the procedure of the registration control program 1 1 1 started by the system control 
program 1 10 at step 1 102 shown in drawing 1 1 is explained using the PAD diagram of drawing 12 . 
[0080] In the registration control program 1 1 1, in step 1200, the registration document read in program 
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120 is started first, the document (it is hereafter called the document for registration) specified as a 
candidate for registration is read, and it stores in a work area 170. 

[0081] Next, in step 1201, the information file creation registration program 121 for full-text searches is 
started, the information for full-text searches corresponding to the registration document stored in the 
work area 170 is created, and it stores in the information file 180 for full-text searches. 
[0082] The above is the procedure of the registration control program 111. 

[0083] Next, the procedure of the retrieval control program 1 12 started by the system control program 
1 10 at step 1 103 shown in drawing 1 1 is explained using the PAD diagram of drawing 13 . 
[0084] First, in step 1300, the retrieval control program 1 12 starts the retrieval condition analyzer 130, 
and extracts a word from a seed document. 

[0085] Next, in step 1301, the word extract program 131 for retrieval is started, the significance of the 
word extracted from the seed document in the above-mentioned step 1300 is computed, and a word with 
a high significance is extracted as a word for retrieval based on predetermined conditions. 
[0086] And in step 1302, the similarity calculation program 132 is started and the similarity of each 
registration document to a seed document is computed using the appearance information on the word for 
retrieval extracted from the seed document in the above-mentioned step 1301. 

[0087] And in step 1303, the retrieval result output program 133 is started and the similarity calculation 

result computed at the above-mentioned step 1302 is outputted as a retrieval result. 

[0088] Here, the output destination change of a retrieval result is good also as what is displayed on a 

display 100, and good also as what is stored on a work area 170 or a magnetic disk 103. Moreover, when 

outputting a similarity calculation result to a display 100, it is good also as what is outputted to the 

descending order of similarity, and good also as what is outputted to the ascending order or descending 

order of a management number given to the document. 

[0089] The above is the procedure of the retrieval control program 112. 

[0090] Next, the procedure of the retrieval condition analyzer 130 started by the retrieval control 
program 1 12 at step 1300 shown in drawing 13 is explained using the PAD diagram of drawing 14 . 
[0091] First, the retrieval condition analyzer 130 starts the seed document acquisition program 140, 
extracts the seed document specified on retrieval conditions in step 1400, and stores it in a work area 
170. 

[0092] Next, in step 1402, the word extract program 142 is started and a word is extracted from the seed 
document stored in the work area 170. 

[0093] and the step 1403 - setting - the count of the appearance in a seed document - counting - a 
program 143 is started, and about the word extracted at step 1402, counting of the count of an 
appearance within a seed document is carried out, and it stores in a work area 170. 
[0094] The above is the procedure of the retrieval condition analyzer 130. 

[0095] Next, the procedure of the word extract program 131 for retrieval started by the retrieval control 
program 1 12 at step 1301 shown in drawing 13 is explained using the PAD diagram of drawing 15 . 
[0096] First, the word extract program 131 for retrieval starts the word significance calculation program 
151, computes the significance of the word stored in the work area 170 based on the predetermined 
formula in step 1500, and stores it in a work area 170. 

[0097] Next, steps 1502-1505 are repeated and performed to all the words stored in the work area 170 at 
said step 1500 (step 1501). 

[0098] First, in step 1502, the word stored in the work area 170 is acquired in descending order of 
significance. 

[0099] Next, in step 1503, the word extract judging program 151 for retrieval is started, and the 
similarity according to element of a seed document is computed. 

[0100] And in step 1504, when the similarity according to element of a seed document judges and is 
over whether it is over the predetermined threshold and it is not over step 1505, repeat processing is 
ended. 

[0101] And in step 1505, it stores in a work area 170 by making this word into the word for retrieval. 
[0102] The above is the procedure of the word extract program 131 for retrieval. 
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[0103] In addition, as shown in the conventional technique 1, the calculation approach of the similarity 
according to element of each word in the above-mentioned step 1502 may compute using the count of an 
appearance in the seed document of each word, and it can also take into consideration the appearance 
positional information within a document further using statistical information, such as the number of 
appearance documents in the document database of this word, so that it may mention later. 
[0104] Next, the procedure of the similarity calculation program 132 started by the retrieval control 
program 1 12 at step 1302 shown in drawing 13 is explained using the PAD diagram of drawing 16 . 
[0105] The similarity calculation program 132 repeats and performs steps 1602-1603 to all the words for 
retrieval stored in the work area 170 (step 1601). 

[0106] At step 1602, the count acquisition program 161 for retrieval of a word appearance is started, the 
count of an appearance within each registration document is acquired with reference to the information 
file 180 for full-text searches corresponding to the word for retrieval, and it stores in a work area 170. 
[0107] Next, in step 1603, the similarity calculation program 162 classified by element is started, the 
similarity according to element of the registration document to a seed document is computed by the 
predetermined formula using the count of the appearance in a seed document of the word for retrieval 
stored in the work area 170, and the count of the appearance in a registration document, and it adds to 
the similarity of the whole registration document. 

[0108] The above is the procedure of the similarity calculation program 132. 
[0109] The above is the first operation gestalt of this invention. 

[01 10] In addition, in this example, although a word shall be extracted from a seed document by the 
retrieval condition analyzer 130, it is good also as that from which n-gram is extracted instead of a word. 
In this case, the unit processed by the word extract program 131 for retrieval also serves as n-gram. 
[0111] Moreover, although the similarity according to element of the seed document computed at step 
1 503 shall judge whether a predetermined threshold is exceeded at step 1 504 of the word extract 
program 131 for retrieval It is good also as what judges whether total of the similarity instead of the 
similarity according to element is over the predetermined threshold, and good also as what judges 
whether the calculation rate of the similarity to total of the similarity according to element in all the 
words extracted from the seed document is over the predetermined threshold further. 
[0112] Moreover, although the count of an appearance of a word was directly used for calculation of the 
similarity of each registration document to a seed document in this example, probably, it will be clear 
that this may be further normalized with the die length of the document of a seed document or a 
registration document etc. 

[0113] Since the number for retrieval of words which follows the value of the similarity according to 

element to a seed document as a guide, and is used for similarity calculation is reduced according to the 

first operation gestalt of this invention as explained above, processing can be terminated by the 

necessary minimum retrieval which the similarity calculation result of a seed document converges. 

[0114] The number for retrieval of words can be reduced as this result, without reducing retrieval 

precision extremely, and a high-speed similar document retrieval can be realized now. 

[0115] In addition, in this example, although the document for registration and the seed document were 

used as the document, probably, it will be clear that you may be a text or a character string. 

[01 16] Moreover, although the value of the similarity according to element of a seed document shall be 

followed as a guide and the words for retrieval shall be reduced in the word extract program 131 for 

retrieval in the first example of this invention explained above, it is good also as what extracts a number 

of words for retrieval specified beforehand. It is good also as what determines the number for retrieval 

of words that retrieval is completed within predetermined time amount using the test pattern prepared 

beforehand as the setting approach of the number for retrieval of words in this case. 

[01 17] Next, the second example of this invention is explained using drawing 17 . 

[0118] In case the second example of the similar document-retrieval system which applied this invention 

computes the significance of the word extracted from the seed document, it uses the statistical 

information of the registration document accumulated in the document database. 

[01 19] According to this approach, in the case of the word significance calculation by the word 
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significance calculation program 150 in the first example, the appearance information not only on the 
appearance information in a seed document but the whole document database can be used, it becomes 
possible to adjust the significance of the word which appears frequently within a document database, 
and word significance can be computed now with high precision compared with the first example. 
[0120] Although this example takes the almost same configuration as the first example ( drawing 1 ), the 
configurations of the word significance calculation program 150 differ, and as shown in drawing 17 , the 
statistical information reference program 1700 is added. 

[0121] Hereafter, the procedure of different word significance calculation program 150a from the first 
example is explained using drawing 18 . 

[0122] Word significance calculation program 150a acquires first the number of appearance documents 
in the document database of each word extracted from the seed document as statistical information of 
this word by starting the statistical information reference program 1700 in step 1800, and referring to the 
information file 180 for full-text searches. 

[0123] In addition, as acquisition of the number of appearance documents of this word was shown from 
the information file 180 for frill- text searches as an information file 803 for frill- text searches shown in 
drawing 8 , it can use that the publication number and appearance location of each word are stored in the 
information file 180 for full-text searches, and the publication number from which this word differs can 
be realized by carrying out counting. 

[0124] And in step 1801, the significance of each word extracted from the seed document is computed 
using the statistical information in the count of the appearance in a seed document and document 
database of this word, and it stores in a work area 170. 

[0125] The above is the procedure of word significance calculation program 150a. 
[0126] In addition, if it considers as the word significance formula in this example, it is good also as a 
thing using the TF-IDF (Text Frequency, Inverted Documents Frequency) method, for example. 
[0127] The above is the second example of this invention. 

[0128] As explained above, the word significance in consideration of the word (it is hereafter called a 
frequent appearance word) which appears frequently within a document database can be computed now 
by using the similar document-retrieval system in the second example of this invention. Namely, it is 
low in the word significance of a frequent appearance word, and by setting up the word significance of a 
rare word highly, it becomes possible to choose the word showing the description of a seed document 
preferentially, and a highly precise similar document retrieval can be realized now. 
[0129] Next, the third example of this invention is explained using drawing 19 . 
[0130] Although the statistical information of the registration document accumulated in the document 
database is used in case the third example of the similar document-retrieval system which applied this 
invention computes the significance of the word extracted from the seed document like the second 
example, it differs in that the statistical information file 1900 is used for acquisition of statistical 
information. 

[0131] According to this approach, the statistical information acquisition referred to in the case of the 
word significance calculation in the second example can be performed now at a high speed. 
[0132] Although this example takes the almost same configuration as the second example ( drawing 
17 ), the configurations of the registration control program 1 1 1 differ, and as shown in drawing 19 , the 
statistical information file creation registration program 1900 is added. Moreover, the statistical 
information file 1910 is stored in a magnetic disk drive 103. At step 1800 of said word significance 
calculation program 150a, it comes to refer to the statistical information file 1910 shown in drawing 19 
instead of referring to the information file 180 for full-text searches, in case the statistical information in 
the document database of each word extracted from the seed document is acquired. 
[0133] Hereafter, the procedure of different registration control program 111a from the second example 
is explained using drawing 20 . 

[0134] In registration control program 1 1 la, in step 1200, the registration document read in program 120 
is started first, the document specified as a candidate for registration is read, and it stores in a work area 
170. 
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[0135] Next, in step 1201, the information file creation registration program 121 for full-text searches is 

started, the information for full-text searches corresponding to the registration document stored in the 

work area 170 is created, and it stores in the information file 180 for full-text searches. 

[0136] Next, in step 2000, the statistical information file creation registration program 1900 is started, 

the statistical information corresponding to the registration document stored in the work area 170 is 

created, and it stores in the statistical information file 1910. 

[0137] The above is the procedure of the registration control program 111. 

[0138] The example of the statistical information file 1910 created by drawing 21 by the statistical 
information file creation registration program 1900 is shown. 

[0139] The management number 2100, a word 2101, and 2102 appearance documents are stored in the 
statistical information file 1910 shown in this Fig. 

[0140] The example shown in this Fig. shows word"LA" being stored in the field of management 
number"0", and being stored as the number of appearance documents of this word is "1." 
[0141] In addition, although the statistical information file 1900 shall be stored by the tabular format in 
the example shown in drawing 21 , as long as it is the format which can acquire a word and the number 
of appearance documents, you may be what kind of format. For example, it does not matter as what is 
stored in a try format, and does not matter as what is stored in the head field of the information file 180 
for full-text searches. 

[0142] The above is the third example of this invention. 

[0143] As explained above, according to the third example of this invention, by referring to the 
statistical information file beforehand created by acquisition in the statistical information of each word 
extracted from the seed document at the time of document registration processing, it becomes 
unnecessary to carry out counting of the number of a different appearance publication number with 
reference to the information for full-text searches, and statistical information can be acquired now at a 
high speed. Thereby, a high-speed similar document retrieval can be realized now compared with the 
second example. 

[0144] Next, the fourth example of this invention is explained using drawing 22 . 
[0145] The fourth example of the similar document-retrieval system which applied this invention 
approximates and uses the statistical information of each word extracted from the seed document. 
[0146] According to this approach, the capacity of the statistical information stored in the statistical 
information file 1910 in the third example can be reduced, without reducing the precision of statistical 
information extremely. 

[0147] Although this example takes the almost same configuration as the third example ( drawing 19 ), 
the configurations of the statistical information reference program 1700 differ, and the approximation 
statistical information calculation program 2200 is added. 

[0148] Hereafter, the procedure of different statistical information reference program 1700b from the 
third example is explained using dr awing 23 . 

[0149] Statistical information reference program 1700b repeats and performs steps 2301-2304 about all 
the words extracted from the seed document (step 2300). 

[0150] At step 2301, it checks whether the statistical information corresponding to this word is stored 
with reference to the statistical information file 1910. 

[0151] And when this word is stored in the statistical information file 1910, step 2303 is performed, and 
step 2304 is performed when not stored (step 2302). 

[0152] At step 2303, the statistical information of this word is acquired with reference to the statistical 
information file 1910. 

[0153] Moreover, at step 2304, the approximation statistical information calculation program 2200 is 

started, and the approximation statistical information of this word is computed. 

[0154] The above is the procedure of statistical information reference program 1700b. 

[0155] Next, the procedure of the approximation statistical information calculation program 2200 is 

concretely explained using drawing 24 . 

[0156] the word 2400 which serves as an object which acquires statistical information in step 2301 in 
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the example shown in this Fig. first — "LAN" it receives and the statistical information file 1910 is 
referred to. 

[0157] Here, since "LAN" is not stored in the statistical information file 1910, step 2304 is performed. 
[0158] At step 2304, the statistical information of "LA" which is the component of word 2400"LAN", 
and "AN" is acquired, respectively, and few values are set up as statistical information of "LAN" among 
these numbers of appearance documents. 

[0159] a book -- a Fig. -- having been shown -- an example -- **** - " -- LA - " -- statistical 
information - 2401 — storing — having had — an appearance - a document -- a number - " - 807 — " — 
"AN" -- statistical information - 2402 - storing - having had -- an appearance -- a document - a 
number - " - 1512 - " -- comparing -- this -- a result - ****** - 11 - LAN -- " ~ statistical information 

- 2403 - ****** - a value being small - " - LA - " — an appearance - a document - a number - " 

- 807- "-storing (2410). 

[0160] When these differs in the number of appearance documents of word "component of LAN"" LA", 
and "AN", the number of appearance documents of "LAN" uses the property in which it cannot increase 
more than each component. That is, as the number of appearance documents of word"LAN", although 
the number of appearance documents of the "LAN" itself should be used essentially, it refers to as the 
number of appearance documents which approximated the value with few appearance documents among 
"LA" which is the component of word"LAN", or "AN". 

[0161] The above is the concrete procedure of the approximation statistical information calculation 
program 2200. 

[0162] The above is the fourth example of this invention. 

[0163] Since it becomes unnecessary to store no number of appearance documents of words in a 
statistical information file by using the similar document-retrieval system in the fourth example of this 
invention as explained above, the capacity of a statistical information file can be reduced compared with 
the third example. 

[0164] Since the similarity of a seed document is computed in the similar document-retrieval system in 
the fourth example from the first example of this invention and the number for retrieval of words is 
adjusted based on this as explained above, a similar document retrieval is realizable for a high speed, 
securing retrieval precision. 

[0165] Next, the fifth example of this invention is explained using drawing 25 . 

[0166] The fifth example of the similar document-retrieval system which applied this invention outputs 
a retrieval result by predetermined retrieval time. 

[0167] According to this approach, since a user can acquire a retrieval result by predetermined retrieval 
time, he can judge without stress whether the seed document specified on retrieval conditions has agreed 
for the purpose of retrieval. 

[0168] Although this example takes the almost same configuration as the first example ( drawing 1 ), the 
configurations of the similarity calculation program 132 differ and the retrieval processing-time 
measurement program 2500 is added. 

[0169] Hereafter, the procedure of different similarity calculation program 132b from the first example 
is explained using the PAD diagram of drawing 26 . 

[0170] In step 2600, similarity calculation program 132b starts the retrieval processing-time 
measurement program 2500, and starts measurement of the retrieval processing time. 
[0171] Next, if the retrieval processing time becomes below a predetermined value (it is hereafter called 
the retrieval time limit) to all the words for retrieval stored in the work area 170, steps 1602, 1603, and 
2602 will be repeated and performed (step 2601). 

[0172] At step 1602, the count acquisition program 161 for retrieval of a word appearance is started, the 
count of an appearance within each registration document is acquired with reference to the information 
file 180 for full-text searches corresponding to the word for retrieval, and it stores in a work area 170. 
[0173] Next, in step 1603, the similarity calculation program 162 classified by element is started, the 
similarity according to element of the registration document to a seed document is computed by the 
predetermined formula using the count of the appearance in a seed document of the word for retrieval 
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stored in the work area 170, and the count of the appearance in a registration document, and it adds to 
the similarity of the whole registration document. 

[0174] And in step 2602, the retrieval processing-time measurement program 2500 is started, the elapsed 
time of the retrieval processing time is measured, and the retrieval processing time is computed. 
[0175] The above is the procedure of similarity calculation program 132b. 
[0176] The above is the fifth operation gestalt of this invention. 

[0177] In addition, the retrieval time limit in step 2601 of this example is good also as what is specified 
as retrieval conditions at the time of retrieval activation, and good also as what is beforehand set up as a 
system construction value. 

[0178] Moreover, since it thinks also when only a small number of word for retrieval is used depending 
on the set point, you may enable it to set up the minimum number for retrieval of words for maintaining 
retrieval precision in this example, although the retrieval time limit shall be set up. In this case, even if 
the retrieval processing time exceeds the retrieval time limit, the specified minimum number for 
retrieval of words will repeat similar retrieval. 

[0179] Furthermore, although the time amount which similarity calculation processing takes using the 
retrieval processing-time measurement program 2500 shall be measured in this example, it is good also 
as what measures the retrieval processing itself. In this case, what is necessary is to start the retrieval 
processing-time measurement program 2500, and just to start measurement of the retrieval processing 
time, before starting the retrieval condition analyzer 130 by the retrieval control program 1 12 rather than 
starting measurement of retrieval time at step 2600 shown in drawing 26 . 

[0180] Since the number for retrieval of words is adjusted in the similar document-retrieval system in 
the fifth example of this invention based on the time amount which retrieval takes as explained above, a 
retrieval result can be acquired by the predetermined processing time. 
[0181] As this result, a user can predict retrieval end time now. 

[0182] In addition, it is also possible to use it by the time of retrieval activation or system definition 
from the first example, changing the similar document-retrieval system which ends retrieval for the 
retrieval time which explained to the standard the similarity of the seed document explained in the fourth 
example in the similar document-retrieval system which ends retrieval, and the fifth example to a 
standard. 

[0183] Next, the sixth example of this invention is explained using drawing 27 . 

[0184] The sixth example of the similar document-retrieval system which applied this invention asks a 

user for a check, when presuming retrieval time and requiring huge time amount from the word for 

retrieval used for retrieval from the word extracted from the seed document. 

[0185] According to this approach, by the word extraction condition for retrieval in the similar 

document-retrieval system explained in the fourth example from the first example, since retrieval can be 

canceled in advance when retrieval takes huge time amount, a user loses being carelessly kept waiting. 

[0186] Although this example takes the almost same configuration as the first example ( drawing 1 ), the 

configurations of the word extract program 131 for retrieval differ, and as shown in drawing 27 , the 

retrieval time presumption check program 2700 is added. 

[0187] Hereafter, the procedure of different word extract program 131b for retrieval from the first 
example is explained using the PAD diagram of drawing 28 . 

[0188] In the word extract program 131 for retrieval, first, the word significance calculation program 
151 is started, the significance of the word stored in the work area 170 based on the predetermined 
formula is computed in step 1500, and it stores in a work area 170. 

[0189] Next, steps 1502-1505 are repeated and performed to all the words stored in the work area 170 at 
said step 1500 (step 1501). 

[0190] First, in step 1502, the word stored in the work area 170 is acquired in descending order of 
significance. 

[0191] Next, in step 1503, the word extract judging program 151 for retrieval is started, and the 
similarity according to element of a seed document is computed. 

[0192] And in step 1504, when the similarity according to element of a seed document judges and is 
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over whether it is over the predetermined threshold and it is not over step 1505, repeat processing is 
ended. 

[0193] And in step 1505, it stores in a work area 170 by making this word into the word for retrieval. 
[0194] Next, in step 2800, when retrieval time is presumed from the word for retrieval stored in the 
work area 170 and the presumed retrieval time (it is hereafter called presumed retrieval time) exceeds a 
predetermined value (assignment retrieval time), the message which checks continuation of retrieval is 
displayed and an user validation is received. As this acknowledgement message, as shown, for example 
in drawing 6 , the message 2900 which has the continuation carbon button 2901 and Cancel button 2901 
may be displayed. 

[0195] The above is the procedure of word extract program 131b for retrieval. 
[0196] In addition, as assignment retrieval time in the above-mentioned step 2800, it is good also as 
what is specified as retrieval conditions, good also as what is beforehand specified as system definition, 
and good also as a certain thing which is, crawls and is automatically set up from the result of the test 
pattern of shoes. 

[0197] Moreover, as the presumed approach of the retrieval time in the above-mentioned step 2800, it is 
good also as what is presumed from the number of appearance documents of this word for retrieval, and 
good also as what is presumed from the size of the information file 180 for full-text searches 
corresponding to this word for retrieval. Or the mean time which one word for retrieval takes using some 
test patterns is measured, and it is good also as what presumes retrieval time using this mean time. 
[0198] Since it becomes possible to adjust the extraction condition of the word for retrieval when 
retrieval time is presumed from the extracted word for retrieval in the similar document-retrieval system 
shown in this example and presumed retrieval time exceeds the time amount specified beforehand as 
explained above, a user loses being carelessly kept waiting. 
[0199] 

[Effect of the Invention] As explained above, in this invention, the number for retrieval of words which 
uses the similarity of a seed document for similarity calculation since the number for retrieval of words 
is set as a standard is reducible. Thereby, the high-speed similar document retrieval which can secure 
retrieval precision is realizable. 



[Translation done.] 
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* NOTICES * 

Japan Patent Office is not responsible for any 
damages caused by the use of this translation. 

l.This document has been translated by computer. So the translation may not reflect the original 
precisely. 

2 **** s hows the word which can not be translated. 
3. In the drawings, any words are not translated. 



CLAIMS 



[Claim(s)] 

[Claim 1] In the similar document-retrieval approach of searching the document with which the contents 
are similar to the document (it is hereafter called a seed document) specified from the document or text 
registered into the document database, or the character string (it is hereafter called a document 
collectively) The index creation step for full-text searches which creates the index for full-text searches 
of the document made applicable to registration as registration processing of the document to a 
document database, The seed document feature-vector creation step which creates the vector data (it is 
hereafter called a seed document feature vector) which used the count of an appearance for every 
character string contained in the specified seed document as retrieval processing of a similar document 
as the element, The character string showing the central contents of this seed document to the character 
string which is the element of said seed document feature vector to that extent The character string 
extract step for retrieval which follows for (calling it character string significance hereafter), extracts, 
and extracts the character string (it is hereafter called the character string for retrieval) used for the 
descending order of this character string significance by predetermined extract criteria at similarity 
calculation, It is related with the character string for retrieval extracted at said character string extract 
step for retrieval. The similarity calculation step which computes the similarity of each registration 
document to a seed document using the appearance information within the seed document of this 
character string for retrieval, and the appearance information within the document (it is hereafter called a 
registration document) registered into the document database, The similar document-retrieval approach 
characterized by having the retrieval result output step which outputs the similarity to the seed document 
of each registration document computed at said similarity calculation step. 

[Claim 2] The similar document-retrieval approach characterized by to have the similarity calculation 
step which computes the similarity of each registration document to a seed document using the count of 
an appearance within the seed document of this character string for retrieval, and the count of an 
appearance within a registration document about the character string for retrieval extracted at said 
character string extract step for retrieval as said similarity calculation step in the similar document- 
retrieval approach according to claim 1. 

[Claim 3] As said character string extract step for retrieval in the similar document-retrieval approach 
according to claim 1 The character string significance calculation step which makes the count of an 
appearance in this seed document the character string significance of this character string about the 
character string which is the element of the seed document feature vector created at said seed document 
feature- vector creation step, The similar document-retrieval approach characterized by having the 
character string judging step for retrieval which extracts the character string for retrieval of the number 
beforehand specified as the descending order of the character string significance computed at said 
character string significance calculation step. 

[Claim 4] As said character string judging step for retrieval in the similar document-retrieval approach 
according to claim 3 Instead of extracting the character string for retrieval of the number specified 
beforehand, the character string used for similarity calculation is extracted in descending order of the 
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character string significance computed at said character string significance calculation step. The similar 
document-retrieval approach characterized by using the character string judging step for retrieval which 
extracts this character string as a character string for retrieval when the similarity to a seed document is 
computed by this character string and this similarity is over the predetermined value. 
[Claim 5] The similar document-retrieval approach carried out [ ending similarity calculation 
processing, when the retrieval processing time measured at the above-mentioned retrieval processing- 
time measurement step exceeds a predetermined value in said similarity calculation step, while adding 
the retrieval processing-time measurement step which measures the time amount which retrieval takes as 
retrieval processing in the similar document-retrieval approach according to claim 1, and ] as the 
description. 



[Translation done.] 
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